ggplot2 is a declarative plotting library in R. The gg in ggplot2 stands for “Grammar of Graphics” a book on the principles of data visualization that the package is based on. Please keep this reference website open through out the call. We will be using the the tidyverse and gapminder packages. Make sure that you create a chunk at the top of your notebook to load these two packages.
There are seven different layers that are used for creating plots in ggplot. We need to specify three out of these seven layers to create any chart (the rest adopt default values if not specified). These are as follows:
With these three layers in mind, lets use the gapminder dataset to draw a few common plot types.
gapminderLets explore the relationship between year and life expectancy using a scatter-plot. The first task is to create the data and the aesthetics layer after which we specify the geom we are interested in plotting. As can be seen below, this is not a particularly useful chart. Each year in the data has several observations for life expectancy (one for each country), this results in a chart for which it is difficult to perceive a clear trend. In the next step we will see how to improve this chart.
ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
geom_point() +
labs(x = "Year", y = "Life Expectancy")⚡Ninja Tasks⚡
🏆Solution🏆
The gapminder data consists of an observation for each country for each year. We are interested in capturing the relationship between GDP per capita and the life expectancy. However, since the data is a time series, we need to average the GDP per capita and life expectancy for each country for the entire time series before plotting. This would smooth out time trends and avoid over-plotting 1. Another way to do this would have been to filter out the most recent year to study the relationship. Both are valid options, however the former has the benefit of capturing more information since it takes into account the entire time series of data that is available to calculate the mean values, while the latter disregard all but one year of the data.
In addition, since there are large values in the GDP per capita, we can plot the log of the mean GDP per capita to smooth out the large numbers (watch the video below if you don’t fully understand logarithmic scales yet). We can do this in two ways, first we could mutate meanGDPperCap by taking its log and plotting that (try this yourself)2, and the second option would be to use ggplots inbuilt scale_x_log10 command to transform the plotting scale instead of the variable. While the charts would largely look the same, the second option preserves the actual variable and plots it on the new scale i.e. we are still plotting mean
gapminder %>%
group_by(country) %>%
summarise(meanGDPperCap = mean(gdpPercap, na.rm = T), meanLifeExp = mean(lifeExp, na.rm = T)) %>%
ggplot(., aes(x = meanGDPperCap, y = meanLifeExp)) +
geom_point(colour = "blue", size = 2.5, alpha = 0.8) +
scale_x_log10() +
labs(x = "Mean GDP per capita", y = "Mean Life Expectancy", title = "Life Expectancy increases with GDP per capita")In this case I don’t apply the log transformation just to show the difference from the previous case. In this case we can see that the true relationship between life expectancy and GDP per capita is exponential. In addition, I map the size of each point to the meanPop variable. Notice how I had to create this variable in the summarise() command so that I could have access to it in the ggplot function.
gapminder %>%
group_by(country) %>%
summarise(meanGDPperCap = mean(gdpPercap, na.rm = T), meanLifeExp = mean(lifeExp, na.rm = T), meanPop = mean(pop, na.rm = T)) %>%
ggplot(., aes(x = meanGDPperCap, y = meanLifeExp, size = log(meanPop))) +
geom_point() +
labs(x = "Mean GDP per capita", y = "Mean Life Expectancy", title = "Life Expectancy increases exponentially with GDP per capita", size = "Log of mean population") +
theme(legend.position = "top")Now lets use dplyr to calculate the average global life expectancy by year and plot that using a line. As can be seen below, the average global life expectancy has been increasing over the years.
gapminder %>%
##group by year to calculate the yearly average life expectancy
group_by(year) %>%
summarise(meanLifeExp = mean(lifeExp, na.rm = T)) %>%
##draw a line marking the trend of life expectancy over years
ggplot(., aes(x = year, y = meanLifeExp)) +
geom_line() +
labs(title = "Average life expectancy has been increasing", x = "Years", y = "Mean Life Expectancy")Now lets go a bit further and explore the trend for life expectancy for different countries. Notice how in the second line I have separated the alpha aesthetic from inside the mapping.3
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp)) +
##colour and group are mapped to continent and country and aesthetic is set to 0.5
geom_line(mapping = aes(colour = continent, group = country), alpha = 0.5) +
labs(title = "Countries in Africa have lower life expectancy", x = "Year", y = "Life Expectancy") +
theme(
legend.position = "top"
)In the chart above there are a few countries were there are sudden drops in life expectancy. These are because of genocides that occurred in Rwanda, Cambodia and China. The chart below used color and alpha mapping to highlight these values. Notice how, I use a alphaMapping and colorMapping variables to fix the alpha and colour aesthetics in the chart.4
##repeat the same but with one line for each continent/country, (guess why there is a sudden drop for a few countries)?
alphaMapping <- if_else(gapminder$country %in% c("Rwanda", "Cambodia", "China"), 0.8, 0.1)
colorMapping <- if_else(gapminder$country %in% c("Rwanda", "Cambodia", "China"), "darkred", "black")
ggplot(data = gapminder, mapping = aes(x = year, y = lifeExp, group = country)) +
geom_line(alpha = alphaMapping, colour = colorMapping) +
labs(title = "The impact of genocides on life expectancy", x = "Year", y = "Life Expectancy")⚡Ninja Tasks⚡
🏆Solution🏆 The chart below shows that poorer countries in the Africa and Asia are still lagging behind those in the West. According to Wikipedia convergence can be described as follows:
The idea of convergence in economics (also sometimes known as the catch-up effect) is the hypothesis that poorer economies’ per capita incomes will tend to grow at faster rates than richer economies. As a result, all economies should eventually converge in terms of per capita income.
gapminder %>%
group_by(continent, year) %>%
summarise(meanGDP = mean(gdpPercap, na.rm = T)) %>%
ggplot(., aes(x = year, y = meanGDP)) +
geom_line(aes(colour = continent)) +
labs(x = "Year", y = "Mean GDP per Capita", title = "Convergence where art thou?", colour = "Continent") +
theme(legend.position = "top")gdpPerCap in the gapminder dataset. There is however a problem with this chart. It shows the distribution of GDP per capita over time. We are however not interested in the spread of GDP per capita across time, rather our interest is in seeing the spread across countries. In this case, we might be better off by filtering down to a single year and observing the distribution of GDP per capita across countries. Lets try this out with the training exercise.ggplot(data = gapminder, aes(gdpPercap)) +
geom_histogram(bins = 30)⚡Ninja Tasks⚡
🏆Solution🏆
For this chart we select the latest year in the dataset and plot the distribution of the population. Could you think of the pros and cons for using this method versus the one in which we calculate the average population for each country over the entire dataset and plotting that? [^3]
gapminder %>%
filter(year == max(year, na.rm = T)) %>%
ggplot(., aes(pop)) +
geom_histogram(bins = 60) +
labs(x = "Population", y = "Count")The chart below shows the average life expectancy for different continents for the most recent year in the data. Can you think of the reason why it might be slightly better to have only considered the most recent year when calculating the average life expectancy for a continent? Also notice, how the bars are aligned from the smallest to the tallest. Can you find out how I might have achieved this?
##draw a bar chart with average life expectancy in different continents
gapminder %>%
filter(year == max(year, na.rm = T)) %>%
group_by(continent) %>%
summarise(meanLifeExp = mean(lifeExp, na.rm = T)) %>%
ggplot(., aes(x = reorder(continent, meanLifeExp), y = meanLifeExp)) +
geom_col() +
labs(y = "Mean life expectancy") +
theme(
axis.title.x = element_blank()
)NA⚡Ninja Tasks⚡
🏆Solution🏆
gapminder %>%
group_by(year) %>%
summarise(meanLifeExp = mean(lifeExp, na.rm = T)) %>%
ggplot(data = ., aes(x = year, y = meanLifeExp)) +
geom_col()NA⚡Ninja Tasks⚡
⚡Ninja Tasks⚡
##explore gdp per capita over the years using a jitter plot and stat_summary (rule of thumb, always stay as close to the data as possible)
ggplot(gapminder, aes(x = year, y = gdpPercap)) +
geom_jitter()##Using vline and hline⚡Ninja Tasks⚡
##Use the plot from the previous section and add some pizazz (explore ggthemes, legend position etc)There are 6 different types of joins. These are as follows:
inner_join(): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returnedleft_join(): return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.right_join(): return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.full_join(): return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.semi_join(): return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.anti_join(): return all rows from x where there are not matching values in y, keeping just columns from xband_membersband_instrumentsinner_join(band_members, band_instruments)Joining, by = "name"
left_join(band_members, band_instruments)Joining, by = "name"
right_join(band_members, band_instruments)Joining, by = "name"
full_join(band_members, band_instruments)Joining, by = "name"
semi_join(band_members, band_instruments)Joining, by = "name"
anti_join(band_members, band_instruments)Joining, by = "name"
This is the most popular type of join (analogous to Vlookup in Excel).
This is because alpha is hard coded to 0.5 and is not mapped to any variables in the data i.e. it is the same value of 0.5 across all data. In other words, all we have done is change it from its default value of 1 to 0.5.↩
These aesthetic parameters are not being mapped to the data. Instead we are using vectors that were created outside the current function call to specify the values for these vectors.↩
This is because alpha is hard coded to 0.5 and is not mapped to any variables in the data i.e. it is the same value of 0.5 across all data. In other words, all we have done is change it from its default value of 1 to 0.5.↩
These aesthetic parameters are not being mapped to the data. Instead we are using vectors that were created outside the current function call to specify the values for these vectors.↩